Variant analysis of SARS-CoV-2 strains with phylogenetic analysis and the Coronavirus Antiviral and Resistance Database

Aims: This study determined SARS-CoV-2 variations by phylogenetic and virtual phenotyping analyses. Materials & methods: Strains isolated from 143 COVID-19 cases in Turkey in April 2021 were assessed. Illumina NexteraXT library preparation kits were processed for next-generation ]sequencing. Phylogenetic (neighbor-joining method) and virtual phenotyping analyses (Coronavirus Antiviral and Resistance Database [CoV-RDB] by Stanford University) were used for variant analysis. Results: B.1.1.7–1/2 (n = 103, 72%), B.1.351 (n = 5, 3%) and B.1.525 (n = 1, 1%) were identified among 109 SARS-CoV-2 variations by phylogenetic analysis and B.1.1.7 (n = 95, 66%), B.1.351 (n = 5, 4%), B.1.617 (n = 4, 3%), B.1.525 (n = 2, 1.4%), B.1.526-1 (n = 1, 0.6%) and missense mutations (n = 15, 10%) were reported by CoV-RDB. The two methods were 85% compatible and B.1.1.7 (alpha) was the most frequent SARS-CoV-2 variation in Turkey in April 2021. Conclusion: The Stanford CoV-RDB analysis method appears useful for SARS-CoV-2 lineage surveillance.

Since its first emergence in December 2019, severe acute respiratory syndrome coronavirus (SARS-CoV-2), the causative agent of the new type of coronavirus disease 2019 (COVID- 19), has had many genetic variations due to its higher mutation rates during replication. Most of these changes are not detrimental and therefore do not contribute to viral evolution [1]. These low effect or no effect changes, which are called silent amino acid changes, do not alter the basic structure and characteristics of the virus, while changes in the structural and nonstructural proteins of SARS-CoV-2 affect the viral antigenic phenotype and confer a fitness advantage. Consequently, emerging variants of SARS-CoV-2 may increase the rate of virus transmission, leading to hospitalizations and increased mortality rates in all age groups [1]. Therefore, for precise management of the ongoing COVID-19 pandemic, SARS-CoV-2 variations should be monitored.
The WHO classifies SARS-CoV-2 variants according to their genetic characteristics associated with transmissibility, increased virulence and ability to escape current diagnostic methods, vaccines and therapeutics. as variants reduce the neutralizing activity of certain monoclonal antibodies and polyclonal antibodies found in the sera of people recovering from infection [2,3]. While there are four different variants defined as alpha (501Y.V1/ B.1.1.7), beta (501Y.V2/ B.1.351), gamma (501Y.V3/P.1) and delta (lineage B.1.617) in the variant of concern (VOC) category, eta, iota, kappa and lambda have been designated as SARS-CoV-2 variant of interest (VOI) variants [2]. Detrimental variants of SARS-CoV-2 are largely caused by mutations in the spike glycoprotein, which mediates cell attachment and is the main target of neutralizing antibodies [3,4]. These variants continue to spread globally posing a major public health threat worldwide. As of August 17th, 2021, cases of alpha, beta, gamma and delta have been reported in 190 countries, 138 countries, 82 countries and 148 countries, respectively [5].
The more opportunity a virus has to spread, the more it will evolve. Therefore, early detection of new cases and monitoring the SARS-CoV-2 genomic sequencing for variations is significant to predict the dominant virus circulating within the population, monitor how SARS-CoV-2 changes over time into new variations that might impact health and update the geographic distribution of variants [6,7]. While SARS-CoV-2 can be detected either by detection of viral nucleic acid, mainly by reverse transcriptase real-time polymerase chain reaction assay (RT-qPCR), or detection of the presence of viral antigen or antibodies against these antigens [8], these tests cannot discriminate variants. Currently, PCR-based variant screening diagnostic assays are widely used in routine diagnostic settings for tracking these variants; however, gene analysis of whole or partial spike sequencing is the most accurate approach to identify variants associated with a specific trait or population [9]. Comprehensive analysis by next-generation sequencing (NGS) and bioinformatics for the ongoing genomic surveillance of SARS-CoV-2 enables the monitoring of viral spread, evolution and variation patterns worldwide in the fight against COVID-19 [10][11][12].
Phylogenetic analysis is widely viewed as the gold standard in genomic epidemiology [13][14][15]. However, with the rapid design of new virtual phenotyping technologies, identification of SARS-CoV-2 mutations can be achieved in a short time and at a low cost. Of these, the Coronavirus Antiviral and Resistance Database (CoV-RDB) by Stanford University that is freely accessible [16], has been designed to promote the comparisons between different candidate compounds against COVID-19, as well as rapid large-scale identification of SARS-CoV-2 mutations, since August 2020 [17]. CoV-RDB explores nucleotide sequences utilizing predetermined consensus SARS-CoV-2 sequences. When performing analysis with CoV-RDB, according to instructions from the database, it is recommended to input the sequences as plain text if only one sequence is analyzed and use the FASTA format if more than one sequence is submitted. The upper limit is currently given as 100 sequences containing ∼30,000 nucleotides per sequence by CoV-RDB. Although CoV-RDB is currently available for clinical diagnosis, its variant diagnostic performance has not been well assessed. The objectives of this study were to reveal the genomic characterization of SARS-CoV-2 by NGS in Turkish patients infected with COVID-19 and identify nucleotide variations by phylogenetic analysis and CoV-RDB virtual phenotyping.

Ethical approval
The ethical approval of this study was received from the Near East University Scientific Research Ethics Committee (decision number: 1383 NEU/2021/93).

Sample selection
In total, 143 SARS-CoV-2 strains isolated from SARS-CoV-2 infected cases in Kocaeli, Istanbul and Ankara in Turkey, at the beginning of April 2021, were included in the study. These strains were included in the study because they were screened with PCR variant screening kits and distinguished as probable SARS-CoV-2 variants.
SARS-CoV-2 real-time polymerase chain reaction A fully automatic rotary nucleic acid magnetic particle extraction system, the Auto Extractor GeneRotex96 (Tianlong Science and Technology Co. Xi'an City, China) was used for SARS-CoV-2 RNA isolation from the nasal/oropharyngeal swab samples. In SARS-CoV-2 diagnosis, a routine RT-qPCR kit that targets double gene (BioSpeedy, Bioeksen Inc, Istanbul, Turkey) was used that is officially preferred by the Ministry of Health in pandemic conditions.
Alignment of the resulting sequences was performed with Miseq Reporter based on BWA software [19]. The analysis of the sequenced data was fitted to the reference genome with BWA software, then analyzed with BaseRecalibrator and ApplyBQSR programs recommended by the Genome Analysis Tool Kit (GATK; Broad Institute, Inc. MA, USA; open source under a BSD 3-clause "New or Revised" license) and refitted according to base-read quality. Variant calling was performed with the Haplotype Caller program and variants with mapping quality below 50, a reading depth below 15 and a variant quality (QUAL) below 500 were eliminated from the analysis with the Variant Filtration program. The sequences of the samples for this region were created by modifying the mutations detected in the reference genome.

Phylogenetic analysis
The neighbor-joining Kimura 80 distance method was performed with other sequences from all SARS-CoV-2 variants from the GeneBank database by using CLC sequence viewer 8.0 software (Qiagen, CLC bio A/S, Aarhus, Denmark). Bootstrap support values were chosen from 1000 replicates in phylogenetic tree construction. Because of numerous samples, the phylogenetic tree has been constructed as circular and rooted. The consensus reference sequence of SARS-CoV-2, MN908947.3, SARS-CoV-2 Wuhan-Hu-1, was used in this study and is available from the GenBank database [20].

Virtual phenotyping
CoV-RDB/SARS-CoV-2 Mutations Analysis by Stanford University [21] was used to explore the nucleotide sequences of the SARS-CoV-2 strains with the consensus SARS-CoV-2 reference sequence and identify SARS-CoV-2 mutations of the spike gene. The obtained SARS-CoV-2 variants/lineages were designated according to the WHO categorization and Centers for Disease Control and Prevention (CDC) SARS-CoV-2 Variant Classification and Definitions [22].

Results
One hundred and forty-three spike gene sequences were included in the study. The sequenced data were analyzed for variations using phylogenetic analysis and virtual phenotyping. Phylogenetic analysis can reveal detailed genomic characterization and evolutionary development of organisms. As the most accurate gene tree rooting method, the SARS-CoV-2 variations obtained using the newly designed CoV-RDB were compared with phylogenetic analysis. Based on the variant classification, 109 (76%) and 122 (85%) SARS-CoV-2 variations were reported by phylogenetic analysis and CoV-RDB, respectively. Of these variations detected by CoV-RDB, n = 15, 10% were missense mutations.
While the variations were obtained as lineages by phylogenetic analysis, CoV-RDB provided the mutation patterns and protein substitutions in addition to the lineages. Figure 1 illustrates different lineages obtained by the neighbor-joining method and Table 1 ( 144) were considered missense mutations, as they involve different amino acid changes for which the impact has not been well identified.     The distribution of SARS-CoV-2 variations as lineages and amino acid mutations identified by phylogenetic analysis and using the CoV-RDB is given in Table 1. When the variations obtained by CoV-RDB were compared with the variations obtained by phylogenetic analysis, a similarity rate of 121 (85%) was observed in the genome analysis of the two variant detection methods. The highest similarity was observed in the identification of B.1.1.351 (100%), followed by B.1.1.7 (92%), by the two methods. Similarity rates of SARS-CoV-2 variations by phylogenetic analysis and CoV-RDB are given in Table 2. Consequently, B.1.1.7 (alpha) was the most frequent SARS-CoC-2 variation in Turkey in April 2021.

Discussion
Continuous description of the genomic characterization of SARS-CoV-2 followed by variant analysis with powerful online tools is crucial, as it provides important information on changes in COVID-19 epidemiology, clinical disease outcomes and efficiency of diagnostics, vaccines and therapeutics, due to viral genome diversity [23]. In the current study, we sequenced the spike gene of SARS-CoV-2 strains of COVID-19-infected cases in Turkey in April 2021, as the S gene is key for SARS-CoV-2 surveillance to identify nucleotide variations [15,[24][25][26]. In SARS-CoV-2 spike genomes, we reported 76% and 85% nucleotide variations by phylogenetic analysis and CoV-RDB analysis, respectively.
The genomic findings revealed that although two major VOCs, including B.  [29] and by , 148 countries confirmed the presence of the delta variant [30]. Tracking changes in the SARS-CoV-2 spike reveals that SARS-CoV-2 variations should be monitored continuously by genome sequence analysis in Turkey and in other countries.
During the pandemic, it is important to identify variants as quickly as possible. In this study, we evaluated the sequenced data for SARS-CoV-2 variations by two different variant detection methods to better understand the diagnostic power of tools commonly used in variant analysis. As there are no data in the literature that reflect this comparison, we evaluated the detection performance of a virtual phenotyping method with the gold standard method, phylogenetic analysis. The findings showed that the two sequence analysis methods were 85% compatible. Interestingly, we reported the highest similarity in the identification of B.1.1.351 (100%), followed by B.1.1.7 (92%) by two methods. The similarity of the results suggests that the CoV-RDB, which provides more rapid sequence exploring, may also be an alternative appropriate approach in determining SARS-CoV-2 mutations.
Although spike sequencing and analysis are used as the gold standard for accurate genomic surveillance, SARS-CoV-2 PCR variant screening kits were performed before NGS to distinguish particular SARS-CoV-2 variants circulating in Turkey among all SARS-CoV-2 PCR-positive cases. The current findings clarified that 24% and 15% of the strains were identified as wildtype by phylogenetic analysis and CoV-RDB, respectively, although these strains were determined as SARS-CoV-2 variants by multiplex PCR kits. Durner et al. demonstrated the feasibility of Y501 variant-specific PCR for fast and reliable detection of UK SARS-CoV-2 variants in routine diagnosis, and their suspected variant was confirmed by the reference laboratory [31]. Similarly, Zhao et al. provided both the specificity and the sensitivity of the SARS-CoV-2 variants based on multiplex PCR-matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF-MS) at 100% [32]. In another study, the positive and negative predictive values were 100% for RT-qPCR assay for screening the spike N501Y mutation [33]. According to the current findings, variant screening PCR kits could be good alternative choices to detect variant strains for NGS analysis, which enables saving time and cost, especially for developing countries.
To point out the limitations of this study, our genomic analysis identifies variations of cases infected with COVID-19 only in the provinces of Istanbul, Kocaeli and Ankara in April 2021. To reveal the genomic variations of SARS-CoV-2 in the whole of Turkey, more cases from many different cities should be included and these cases should be investigated periodically to provide updated surveillance.

Conclusion
In the COVID-19 pandemic, variant emergence is possible and may be rapid. Therefore, SARS-CoV-2 strains should be constantly monitored. Phylogenetic analysis and Stanford CoV-RDB analysis methods seem useful for this surveillance.

Summary points
• Genomic characterization of SARS-CoV-2 allows the description of important information on phenotypic characteristics, including disease transmission, disease severity, diagnostic escape and immune escape due to emerging new coronavirus variants. • Next-generation sequencing is widely used for genomic characterization of SARS-CoV-2, followed by variant analysis with phylogenetic analysis. • With the rapid design of new virtual phenotyping technologies, identification of SARS-CoV-2 mutations can also be achieved in a short time and at low cost. • B.1.1.7 (alpha) was the most frequent SARS-CoV-2 variation in Turkey in April 2021.
• The Coronavirus Antiviral and Resistance Database (CoV-RDB) by Stanford University that is freely accessible at https://covdb.stanford.edu/, has been designed to promote comparisons between different candidate compounds against COVID-19, as well as rapid large-scale identification of SARS-CoV-2 mutations, since August 2020. • The current findings showed that both sequence analysis methods were 85% compatible.
• Phylogenetic analysis and Stanford CoV-RDB analysis methods seem useful for tracking SARS-CoV-2 strains.

Financial & competing interests disclosure
The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.
No writing assistance was utilized in the production of this manuscript. future science group 10.2217/cer-2021-0208